Speech Emotion Recognition (SER) is the task of recognizing the emotional aspects of speech irrespective of the knowing the meaning of contents. The ability to express emotions is one of the defining aspects of sentient creature. While humans can efficiently understand the non-verbal cues and are even able to perform this task within a conversation, the ability to computers to pick this has been an ongoing subject of research. Robots capable of understanding emotions could provide appropriate emotional responses and exhibit emotional personalities. In some circumstances, humans could be replaced by computer-generated characters having the ability to conduct very natural and convincing conversations by appealing to human emotions. Machines need to understand emotions conveyed by speech. Only with this capability, an entirely meaningful dialogue based on mutual human-machine trust and understanding can be achieved. emotion recognition systems aim to create efficient, real-time methods of detecting the emotions of mobile phone users, call center operators and customers, car drivers, pilots, and many other human-machine communication users. Adding emotions to machines has been recognized as a critical factor in making machines appear and act in a human-like.
Human emotions can be detected from various channels, such as speech, body language, facial expressions, and text. The most obvious channel for emotion recognition is through speech. Just like studying body language and text to process user sentiment, speech signals can encompass a wealth of information related to emotional characteristics. While this is an easy task for humans, computers still have a long way to go before emotion recognition becomes a form of artificial intelligence. The biggest impediment in using speech is that there is no discrete speech feature that correlates to reflect the speaker’s emotions directly. An added challenge is the problem of having limited training data and low prediction accuracy. Understanding human emotions is critical in areas such as psychology, criminology, banking, and insurance predominantly to aid in appropriate corrective action in cases of crime, fraud, etc. In these contexts, emotions in speech can be used to infer various facets of human behaviour irrespective of language, ethnicity, and other such distinguishing factors. Building a model that can identify emotions from speech can enable integration into such platforms to aid in decision making.
The data is an audio file and there are multiple ways the features can be extracted from the audio file. The sound excerpts are digital audio files in the format of .wav file. Sampling is the process of digitizing the sound waves of continuous signal to a series of series of discrete signals.
Since the audio file is saved in the .wav format, it is easy to use Librosa or other library like Torchaudio.
The speech files which are in the format of .wav file are digitized using the Librosa/ Torch audio library. These digitised samples are trimmed to remove the leading and trailing silences, and zero padded to make them consistent in length for future processing. This signal is then transformed to the frequency domain using the short-term Fourier transform, which now makes the audio file amenable to feature extraction.
The Mel-spectrogram captures the visual representation of temporal changes in the energy of different frequency bands. Humans cannot perceive frequencies on a linear scale. The Mel scale overcomes this by adopting a non-linear scale to ensure that equal distance between frequencies sound equally different for human ears. These features would then be able to lend themselves to deep learning models. The Mel-spectrogram of an audio file from the RAVDESS database after the pre-processing steps is as shown in Fig. Fig. 2.1.a.

| Fig [2.1.a] |
In the context of speech, MFCCs are the commonly used features and contain the “voice fingerprint” of speech . While the proposed methodology explores MFCCs as features for emotion recognition, it also looks at using Mel-spectrograms as features

| Fig [2.1.b] |
A feature extractor is in charge of preparing input features for a multi-modal model. This includes feature extraction from sequences, e.g., pre-processing audio files to Log-Mel Spectrogram features, feature extraction from images e.g. cropping image image files, but also padding, normalization, and conversion to Numpy, PyTorch, and TensorFlow tensors.
In Hugging Face Transformers, the Wav2Vec2 model is thus accompanied by both a tokenizer, called Wav2Vec2CTCTokenizer, and a feature extractor, called Wav2Vec2FeatureExtractor.
An audio file usually stores both its values and the sampling rate with which the speech signal was digitalized. We want to store both in the dataset and write a map(...) function accordingly.

The pretrained Wav2Vec2 checkpoint maps the speech signal to a sequence of context representations as illustrated in the figure above. A fine-tuned Wav2Vec2 checkpoint needs to map this sequence of context representations to its corresponding transcription so that a linear layer has to be added on top of the transformer block (shown in yellow).
The most popular speech dataset considered (RAVDESS) has a total of 1440 points for a set of 8 emotions. This data was recorded in a studio environment to ensure no noise on speakers with North American accent. The other dataset, CREMA-D is another English language by a larger demographics of speakers with a variety of accents and has 7442 data in total. The total data points commercially available to build a robust speech emotion recognition system suffers from the disadvantage of having a too few for the state of art deep learning models as neural networks can presently support parameters in the order of millions and to obtain a reliable performance there is a dire need to feed in a proportionally large amount of data to improve the performance. Therefore, augmentation techniques are performed before training the dataset with a deep learning model. The data post augmentation is then used for analysis and processing. For a given spectrogram x axis represents the time and y-axis represents the frequency. There are several augmentation techniques. We focus on the following to augment the data and perform the analysis for these augmentation techniques:
Speech signals are time-variant in nature and to extract the information from a signal, it is broken down into shorter segments. We break down the input signals to temporal segments.
The data is sampled at 8000Hz and the audio files are an approximate of ~3 seconds and ~1.5 post snipping the leading and lagging silence. The frame size of each window is set as .5s. The hop length for the audio is .25 seconds in time.

| Fig [3.1.a] Original Spectrogram |
| Fig [3.1.b] Example of windowing a signal at 1s frame size hop and .25 seconds. |
Case 2 : The frame size of each window is set as ~.5 s
| Fig [3.1.c] Example of windowing a signal at .5 frame size hop and .25 seconds. |
The paper discusses three representative techniques for data augmentation through the following techniques:
1) Frequency masking A frequency channels [f0, f0 + f) are masked. f is chosen from a uniform distribution from 0 to the frequency mask parameter F, and f0 is chosen from (0, ν − f) where ν is the number of frequency channels.
| Fig [3.2.a] Example of applying Frequency masking on a the spectogram signal |
2) Time masking t consecutive time steps [t0, t0 + t) are masked. t is chosen from a uniform distribution from 0 to the time mask parameter T, and t0 is chosen from [0, τ − t).
| Fig [3.2.b] Example of applying Frequency masking on a the spectogram signal |
3)Time warping
A random point is chosen and is warping to either left or right with a distance w which chosen from a uniform distribution from 0 to the time warp parameter W along that line. As the paper discussed that time warping did not improve model performance, the experiments conducted considers only frequency and time masking.
4)Combining Frequency and Time masking
| Fig [3.2.c] Example of applying Frequency and Time masking on a the spectogram signal |
The Models were trained on Jupyter notebooks with Anaconda Python enviroment. Reproducing the results would require training the models and installing the necesssary packages.
The Anaconda python environment where the model had been trained is exported to a YML file. The enviroment can be set up in a new computer by running the following command:
Using this command installs the transformers_4p8_env environment. This environment will get installed in their default conda environment path.
If you want to specify a different install path than the default for your system (not related to 'prefix' in the environment.yml), just use the -p flag followed by the required path.
conda env create -f environment.yml -p /home/user/anaconda3/envs/env_name
The emotion corpus that have been annotated and evaluated till date are :
Dataset location: _EmotionData/Datasets/RAVDESS
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) is a widely used database for Emotion Recognition data. The dataset contains 24 professional actors (12 female and 12 males) who vocalize on two lexically matched statements in 8 sets of emotion (anger, disgust, fear, happiness, surprise, sadness, and neutral, calm).
All speakers in this database have a North American accent. The modalities covered in the database are: Audio (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), Video only. There are 1440 stimuli in total. The reference to file naming conventions used for pre-processing the data [xx]
In the experiments, only the audio speech data is used to test the performance of the model as the aim is to acquire emotions from spoken audio.

| Fig [4.a] | Fig [4.b] |
Figures [4.1.a] - [4.1.d] represent the distribution of the data RAVDESS audio data. Fig [4.1.a] is distribution of emotions in RAVDESS data. There are 8 emotions in total. Most of the emotions but for neutral are balanced in the dataset. Fig [4.1.b] represents the distribution of Actor’s Gender in RAVDESS dataset with samples in Male and Female being balanced.

| Fig [4.1.c] | Fig [4.1.d] |
Fig [4.1.c] represents the distribution of Intensity of Emotions, there are fewer samples strong in emotions in this dataset. This is because there is not strong intensity for neutral emotion Fig [4.1.d] represents the distribution of two Statements used throughout RAVDESS data, which is balanced
import librosa,librosa.display
import IPython.display as ipd
import numpy as np
import os
def melspectrogram(audio, sample_rate):
mel_spec = librosa.feature.melspectrogram(y=audio,
n_fft=1024,
win_length = 512,
window='hamming',
hop_length = 100,
n_mels=128,
fmax=sample_rate/2
)
mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
return mel_spec_db
def load_audio(aud_loc,SR):
audio, sample_rate = librosa.load(aud_loc, offset=0,sr=SR)
trimmed , idx_trimmed = librosa.effects.trim(audio, top_db=30)
mel_spectrogram = melspectrogram(trimmed, SR)
librosa.display.specshow(mel_spectrogram, y_axis='mel', x_axis='time')
print('MEL spectrogram shape: ',mel_spectrogram.shape)
print('The Original Audio can be found in path :\n {}'.format(aud_loc))
return ipd.Audio(trimmed, rate=SR)
audio_loc= 'Emotion_Data/Datasets/RAVDESS/Audio_Speech_Actors_01-24/Actor_01/03-01-01-01-01-02-01.wav'
load_audio(audio_loc,SR=8000)
File naming convention for RAVDESS dataset is:
Each of the RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:
Filename identifiers :
Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
Filename example: 02-01-06-01-02-01-12.mp4
Dataset location: _EmotionData/Datasets/TESS
The TESS (Tess Emotional Speech Set) consists of a phrase “Say the word” followed by a set of 200 target words were spoken by two female actors. The younger actor is aged at 24 years and older actor, aging 64 years. The recordings of the set portrayed each of the 7 emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral. There are 2800 stimuli in total.
In the experiments,the audio speech and transcripts are used to test the performance of the model as the aim is to acquire emotions either induvidually / through fusion (Audio and Text).

| Fig [4.2.a] | Fig [4.2.b] |
audio_loc= 'Emotion_Data/Datasets/TESS/OAF_Fear/OAF_back_fear.wav'
load_audio(audio_loc,SR=8000)
The data naming convention for TESS is straight forward. The files are ordered by emotions as seen below in 4.2.c.

| Fig [4.2.c] | Fig [4.2.d] |

| Fig [4.2.e] the data is loaded into a dataframe. The dataframe contains essential information like the path and emotion |
The Mel-spectrogram for the TESS files are saved in the directory :
_Emotion_Data/Datasets/TESSMEL/
Example of an audio file:

| Fig [4.2.f] An extracted Mel-spectrogram from TESS dataset stored as an image |
The IEMOCAP dataset consists of 151 videos of recorded dialogues, with 2 speakers per session for a total of 302 videos across the dataset. Each segment is annotated with a presence of upto 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed and neutral). Apart from this the dataset contains other essential markers for emotinos such as valence, arousal and dominance. The dataset is recorded across 5 sessions with 5 pairs of speakers.
Some studies conducted on emotion recogontion considers the emotions in a 3 dimensional spectrum scale where the valance, arousal and dominance comes into emphasis. However, the experiment conducted considers each emotion as discrete value. All speakers in this database have a North American accent. The modalities covered in the database are: Audio (16bit, 48kHz .wav). There are 1440 stimuli in total.
In the experiments,the audio speech and transcripts are used to test the performance of the model as the aim is to acquire emotions either induvidually / through fusion (Audio and Text).

| Fig [4.3.a] | Fig [4.3.b] |
Figures [4.3.a] - [4.3.b] represent the distribution of the data EMOCAP data. Fig [4.3.a] is distribution of emotions in IEMOCAP data before removing the un-balanced emotions. There are 9 emotions in total and can be clearly seen that the emotions are not balanced. The emotions lesser than 10 are combined together to form other emotions. The data is not balanced in terms of emotion distribution. In Fig [5.a] there are 1600+ data points for Neutral and 1000+ data points for angry, sad and excited and happy has 100. Training the deep learinnig model with this imbalance will cause the model to be biased with the emotion with maximum data points (e.g Neutral). Therefore, to train this data, only the following emotions are considered : Happy and Excited, Neutral,Angry and Sad and other emotions are marked as others as see in the Fig [xx]. Fig [4.3.b] represents the distribution of emotions in IEMOCAP dataset. Only the emotions Happy and exicted,Neutral,Angry and Sad are used in the experiments.

| Fig [4.3.c] | Fig [4.3.d] |
Fig [4.3.c] represents the distribution of Intensity of Emotions, there are fewer samples strong in emotions in this dataset. This is because there is not strong intensity for neutral emotion Fig [4.3.d] represents the distribution of two Statements used throughout RAVDESS data, which is balanced
The IEMOCAP dataset consists of videos,audios and transcripts containing either scripted or improvised utterances. The video data consisting of the facial landmarks are not regarded in the experiments.
The dataset is spread out in sessions between (Session 1- Session 5) containing the transcripts, audio utterance across an entire conversation. The Pre-processing scripts will segment the entire uttarence based on each speaker. This is done with the transcripts which contains speaker information like start time and stop time, dialogue and emotion included in this dataset.
The reference to file naming conventions used for pre-processing the data [xx] is shown below: Fig[xx] below shows an example of the audio before and after pre-processing. The Pre-processing scripts will segment the whole utterence in the DIR to setences it is located in DIR.
| Fig [4.3.e] |
| Fig [4.3.f] |
audio_loc_utt= 'Emotion_Data/IEMOCAP_full_release/Session5/dialog/wav/Ses05F_impro04.wav'
load_audio(audio_loc_utt,SR=8000)
audio_loc_sent= 'Emotion_Data/IEMOCAP_full_release/Session5/sentences/wav/Ses05F_impro04/Ses05F_impro04_F003.wav'
load_audio(audio_loc_sent,SR=8000)
The IEMOCAP dataset consists of videos,audios and transcripts containing either scripted or improvised utterances. The transcripts contains essential speaker information like start time and stop time, dialogue and emotion included in this dataset.
They dialogue are sent to the text model and the start time and stop time is used to segment the audio

| Fig [4.3.g] |
The Fig [4.3.g] shows the information included in the transcripts for a sample utterance Ses05_impo04
The scripts referenced below will search through thr transcripts directory and gather information on the number of files available. This then generates a file containing the transcripts.csv and processed_tran.csv. (DIR + IEMOCAP/Audio/processed_tran.csv)
_Shreyah_code/IEMOCAP/Audio/Preprocessingscript.ipynb. Running the script will produce the files processed_tran.csv and processed_label.txt which contain the SessionID, Dialogue and the labels respectively
| Fig [4.3.h] |
Similary by summing the number of files in all of the sessions from the directory/Emoevaluation we can confirm the overall main utterances match with transcripts.
The directory sentences have the segmented utterences and summing them yields the results for all sessions. diagues for the session match with transcripts as seen below. After summing all the sentences for all sessions we have around 10000 sentences.
| Fig [4.3.i] |
The processed_tran.csv and processed_label.txt are then merged together into a dataframe shown below. The fields contain Session_ID to access the Path to the segmented wav file, Emotions and Text
| Fig [4.3.j] |
The text pre-processing involved the same steps of merging together the processed_tran.csv and processed_label.txt into a dataframe. The location of these files are in DIR :
(Crowd-sourced Emotional Multimodal Actors Dataset)
License: This Crowd-sourced Emotional Mutimodal Actors Dataset (CREMA-D) is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/
CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified).
Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral and Sad) and four different emotion levels (Low, Medium, High and Unspecified).
Participants rated the emotion and emotion levels based on the combined audiovisual presentation, the video alone, and the audio alone. Due to the large number of ratings needed, this effort was crowd-sourced and a total of 2443 participants each rated 90 unique clips, 30 audio, 30 visual, and 30 audio-visual. 95% of the clips have more than 7 ratings.
The description below specifies the data made availabe in this repository.
For a more complete description of how CREMA-D was created use this link or the link below to the paper.
Audio Files_ MP3 Audio files used for presentation to the Raters are stored in the AudioMP3 directory.
Processed Audio WAV Audio files converted from the original video into a format appropriate for computational audio processing are stored in the AudioWAV directory.
In the experiments,the audio speech and transcripts are used to test the performance of the model as the aim is to acquire emotions either induvidually / through fusion (Audio and Text).

| Fig [4.3.b] |
Fig [4.3.b] depicts the distribution of emotions in CREMA-D data. There are 6 emotions in total. Almost all emotions are balanced with 1200 stimuli for each emotion.
audio_loc= 'Emotion_Data/cremad/AudioWAV/1091_WSI_NEU_XX.wav'
load_audio(audio_loc,SR=8000)
The data naming convention for CREMA-D isFilename labeling conventions The Actor id is a 4 digit number at the start of the file. Each subsequent identifier is separated by an underscore (_).
Actors spoke from a selection of 12 sentences (in parentheses is the three letter acronym used in the second part of the filename):
The sentences were presented using different emotion (in parentheses is the three letter code used in the third part of the filename):
Sad (SAD)
Emotion level (in parentheses is the two letter code used in the fourth part of the filename):
Low (LO)

| Fig [4.4.a] |
The Mel-Spectrograms and Windows extracted are in the directory
DIR : Emotion_Data/cremad/split/
Apart from the audiofiles directory in "SER/Emotion_Data/cremad/AudioWAV/" there is another folder containing the pre-processed spectrogram as images in / Emotion_Data/cremad/split/ . The script will divided to train, test and validation and lets the images be saved in the respective directories
An example of some of the pre-saved Mel-spectrograms and Windowed mel spectrograms are below:

The CREMA dataset is ideally to test the efficacy of model as it consists of videos and audios for a much wider range of participants 90 people, has a diverse range of participants with various accents and the data is crowd sourced making this data prone to external noise. However, there is the hypothesis that real life emotions are not simila to acted emotions.
The sentences are majorly scripted, the same phases is spoken with different emotions and text extracted in this case is then not useful.
(CMU Multimodal Opinion Sentiment and Emotion Intensity)
License: Copyright 2018 The CMU-MultimodalSDK Contributers
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. FURTHERMORE, IN NO EVENT SHALL THE CREATORS OF THE SOFTWARE BE RESPONSIBLE FOR CONSUMER ABUSE (COPYRIGHT VIOLATION, SUBJECT COMPLAINTS) OF DATASETS BOTH STANDARD AND NON-STANDARD DATASETS. THE DATASETS AND MODELS ARE PROVIDED AS IS. USERS ARE RESPONSIBLE FOR PROPER USAGE OF DATASETS AND MODELS INCLUDING RIGHTS OF SUBJECTS WITHIN THE VIDEOS AS WELL AS PROPER CONDUCT WHEN IT COMES TO PATENTED MODELS, DATASETS OR SCIENTIFIC FINDINGS (THROUGH CITING PATENTS OR SCIENTIFIC PAPERS).
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) is the largest dataset of sentence level sentiment analysis and emotion recognition in online videos. CMU-MOSEI contains more than 65 hours of annotated video from more than 1000 speakers and 250 topics. Each video segment contains manual transcription aligned with audio to phoneme level.
All the videos are gathered from online video sharing websites.
The dataset is currently a part of the CMU Multimodal Data SDK and is freely available
to the scientific community through Github 2The dataset was introduced in the 2018 Association for Computational Linguistics 2018 and used in the co-located First Grand Challenge and Workshop on Human Multimodal Language.
_Shreyah_code/MOSEI/Preprocessing/data/Downloaddata.ipynb
The script downloads the data and stores them in the directory : _Shreyah_code/MOSEI/Preprocessing/Preprocesseddata
CMU-Multimodal SDK provides tools to easily load well-known multimodal datasets and rapidly build neural multimodal deep models. Hence the SDK comprises of two modules: 1) mmdatasdk: module for downloading and procesing multimodal datasets using computational sequences. 2) mmmodelsdk: tools to utilize complex neural models as well as layers for building new models. The fusion models in prior papers will be released here.
All the datasets here are processed using the SDK (even the old_processed_data folder which uses SDK V0).
The following link contains the word-aligned data and models to run experiments on (without any new alignment techniques)
Data: http://immortal.multicomp.cs.cmu.edu/raw_datasets/processed_data/
--> Old preprocessed datasets used in our papers are available at: http://immortal.multicomp.cs.cmu.edu/raw_datasets/processed_data/
Audio Files_ MP3 Audio files used for presentation to the Raters are stored in the AudioMP3 directory.
Processed Audio WAV Audio files converted from the original video into a format appropriate for computational audio processing are stored in the AudioWAV directory.
A single audio can exhibit multiple emotions simultaneously where up to 6 emotions (Happy, Angry, Sad, Disgust, Fear) can co-exist. However, the emotions were heavily imbalanced in the data, biasing the model towards predicting emotions from the majority emotion classes.
| Fig [4.5.2.b] |
| Fig 4.5.2.c] |
Fig [4.5.2.b] depicts the distribution of emotions. The graph shows that multiple emotins co-exist. i.e. a single audio can exhibit multiple emotions simultaneously where up to 6 emotions (Happy, Angry, Sad, Disgust, Fear) can co-exist
Fig [4.5.2.c] depicts the table containing the CMU data and it's associate emotions preprocessing. It can be seen the file -3g5yACwYnA[2] in dataframe form showing three emotions of Happy, Sad and Anger at the same time.
The audio playback of the file -3g5yACwYnA[2] is refernece in the cell below:
audio_loc= 'Emotion_Data/CMU_MOSEI/Raw/Audio/Segmented/train/-3g5yACwYnA[2].wav'
load_audio(audio_loc,SR=8000)
The CMU_MOSEI has effective around (65 hours) of audio data. However the challenge with the dataset is the format format annotations where multiple emotions can exist at the same time and the imbalance of the emotions. The emotion distribution is seen in the image below :

| Fig [5.b] |
Segmentation Script:
_Shreyah_code/MOSEI/Preprocessing/Scripts/Generate_audiovec-Part2.ipynb
Visulization and data examination script: _Shreyahcode/MOSEI/Preprocessing/Scripts/Preprocessing.ipynb
Pre-processed directory : _Shreyahcode/MOSEI/Preprocessing/data
The CMU_MOSEI dataset consists of videos,audios and transcripts containing either scripted or improvised utterances. In the experiments conducted, the video data is not regarded.
The transcripts contain the essential information on the speaker time and stop time and audio utterance across an entire conversation. The Pre-processing scripts will segment the entire uttarence based on speaker start time and stop time mentioned in the transcripts file. The transcripts contains speaker information start time and stop time, dialogue and emotion which are segmented by the Preprocessing script.
Fig[xx] below shows an example of an utterance in IEMOCAP is Ses05_imporo04.wav refers to an improvised utterence 04 of a Session 5. The whole utterence in the DIR is segmented with the Pre-processing scripts located in DIR.
audio_loc='Emotion_Data/CMU_MOSEI/Raw/Audio/Segmented/test/zvZd3V5D5Ik[1].wav'
load_audio(audio_loc,SR=8000)
Annotation of CMU-MOSEI follows closely the annotation of CMU-MOSI (Zadeh et al., 2016a) and Stanford Sentiment Treebank (Socher et al., 2013).
Each sentence is annotated for sentiment on a [-3,3] Likert scale of: [−3: highly negative, −2 negative, −1 weakly negative, 0 neutral, +1 weakly positive, +2 positive, +3 highly positive]. Ekman emotions (Ekman et al., 1980) of {happiness, sadness, anger,fear, disgust, surprise} are annotated on a [0,3] Likert scale for presence of emotion x: [0: no evidence of x, 1: weakly x, 2: x, 3: highly x].
The annotation was carried out by 3 crowdsourced judges from Amazon Mechanical Turk platform. To avert implicitly biasing the judges and to capture the raw perception of the crowd, we avoided extreme annotation training and instead provided the judges with a 5 minutes training video on how to use the annotation system. All the annotations have been carried out by only master workers with higher than 98% approval rate to assure high quality annotations 4
.
Figure 2 shows the distribution of sentiment and emotions in CMU-MOSEI dataset. The distribution shows a slight shift in favor of positive sentiment which is similar to distribution of CMU-MOSI and SST.
This is an implicit bias in online opinions being slightly shifted towards positive, since this is also present in CMU-MOSI. The emotion histogram shows different prevalence for different emotions.
The most common category is happiness with more than 12,000 positive sample points. The least prevalent emotion is fear with almost 1900 positive sample points which is an acceptable number for machine learning studies.
The Mel-Spectrograms and Windows extracted are in the directory
DIR : Emotion_Data/CMU_MOSEI/Specterogram/Windows/
As the audio files can range between 30 seconds to couple of minutes, it has to be windowed. The windows were tasken has a hop length of 2 seconds and window frame size of 4 seconds.
Apart from the audiofiles directory in "Emotion_Data/CMU_MOSEI/Raw/Audio/Full" there is another folder containing the pre-processed spectrogram as windows in the DIR Emotion_Data/CMU_MOSEI/Spectrograms/Windows/.As the audio files can range between 30 seconds to couple of minutes, it has to be windowed. The windows were tasken has a hop length of 2 seconds and window frame size of 4 seconds. The script will divided to train, test and validation and lets the images be saved in the respective directories
An example of some of the pre-saved Mel-spectrograms and Windowed mel spectrograms are below:

All videos have manual transcription. Glove word embeddings (Pennington et al., 2014) were used to extract word vectors from transcripts. Words and audio are aligned at phoneme level using P2FA forced alignment model (Yuan and Liberman, 2008). Following this, acoustic modalities are aligned to the words by interpolation. Since the utterance duration of words in English is usually short, this interpolation does not lead to substantial information loss.
Scripts to correlate Audio length and with the texts from transcripts. Here we try to find the word length distribution for the audio files(.wav) in _Emotion_Data/CMU_MOSEI/Raw/Audio/Full/WAV16000/ in the segment with start and compare if the text and the words spoken correalate. _Shreyah_code/MOSEI/Preprocessing/Scripts/audio_textcorrelation.ipynb
The following Deep Learning models were explored with the datasets mentioned and their results have also been demonstrated below :
Datasets :
RAVDESS :
TESS :
EMOCAP:
CREMA-D :
Early-stopping : Shreyah_code/Pretrained/src/CREMA/CREMA-resnet50.ipynb

Classification report
precision recall f1-score support
happy 0.67 0.68 0.67 140
disgust 0.55 0.51 0.53 140
neutral 0.56 0.44 0.49 140
sad 0.57 0.53 0.55 140
fear 0.78 0.47 0.58 120
angry 0.42 0.72 0.53 140
accuracy 0.56 820
macro avg 0.59 0.56 0.56 820
weighted avg 0.59 0.56 0.56 820

Classification report
precision recall f1-score support
happy 0.67 0.80 0.73 140
disgust 0.56 0.68 0.61 140
neutral 0.62 0.64 0.63 140
sad 0.67 0.56 0.61 140
fear 0.79 0.77 0.78 120
angry 0.59 0.45 0.51 140
accuracy 0.65 820
macro avg 0.65 0.65 0.65 820
weighted avg 0.65 0.65 0.64 820
RAVDESS :

TESS:
With early stopping, for 8 emotions (FastAi code): Shreyah_code/Pretrained/src/TESS/TESS-RESNET50.ipynb
Classification report
precision recall f1-score support
happy 0.67 0.80 0.73 140
disgust 0.56 0.68 0.61 140
neutral 0.62 0.64 0.63 140
sad 0.67 0.56 0.61 140
fear 0.79 0.77 0.78 120
angry 0.59 0.45 0.51 140
accuracy 0.65 820
macro avg 0.65 0.65 0.65 820
weighted avg 0.65 0.65 0.64 820
EMOCAP:
With early stopping, for 8 emotions (FastAi code): Shreyah_code/Pretrained/src/EMOCAP/IEMOCAP_resnet50.ipynb

Classification report
precision recall f1-score support
fru 0.51 0.51 0.51 170
exc 0.53 0.21 0.30 300
sad 0.35 0.42 0.38 381
neu 0.46 0.65 0.54 384
ang 0.68 0.51 0.58 245
accuracy 0.46 1480
macro avg 0.50 0.46 0.46 1480
weighted avg 0.49 0.46 0.45 1480
CREMA-D :
Early-stopping : Shreyah_code/Pretrained/src/CREMA/CREMA-6emo_DENSENET201.ipynb
Classification report
precision recall f1-score support
happy 0.67 0.68 0.67 140
disgust 0.55 0.51 0.53 140
neutral 0.56 0.44 0.49 140
sad 0.57 0.53 0.55 140
fear 0.78 0.47 0.58 120
angry 0.42 0.72 0.53 140
accuracy 0.56 820
macro avg 0.59 0.56 0.56 820
weighted avg 0.59 0.56 0.56 820

RAVDESS :
With early stopping and 3 emotions with FastAi:
Shreyah_code/Pretrained/src/RAVDESS/RAVDESS-Emotions-Densenet-3lab.ipynb

Classification report
precision recall f1-score support
angry 1.00 0.67 0.80 12
happy 0.92 0.96 0.94 24
neutral 0.85 0.96 0.90 24
accuracy 0.90 60
macro avg 0.92 0.86 0.88 60
weighted avg 0.91 0.90 0.90 60
With early stopping and 8 emotions with FastAi:
Shreyah_code/Pretrained/src/RAVDESS/RAVDESS-Emotions-Densenet.ipynb

Classification report
precision recall f1-score support
surprised 0.77 0.83 0.80 24
happy 0.68 0.54 0.60 24
fearful 0.52 0.46 0.49 24
disgust 0.78 0.88 0.82 24
neutral 0.68 0.79 0.73 24
sad 0.42 0.46 0.44 24
calm 0.50 0.58 0.54 12
angry 0.84 0.67 0.74 24
accuracy 0.66 180
macro avg 0.65 0.65 0.65 180
weighted avg 0.66 0.66 0.65 180
With Windowing and Post processing 8 emotions with FastAi:Shreyah_code/Pretrained/src/RAVDESS/RAVDESS_Emotions-DENSENET201_windowing-post_processing.ipynb
Classification report
precision recall f1-score support
surprised 0.75 0.56 0.64 16
happy 0.67 0.88 0.76 16
fearful 0.57 0.81 0.67 16
disgust 0.63 0.75 0.69 16
neutral 0.58 0.44 0.50 16
sad 0.00 0.00 0.00 8
calm 0.00 0.00 0.00 0
angry 0.45 0.31 0.37 16
none 0.85 0.69 0.76 16
accuracy 0.59 120
macro avg 0.50 0.49 0.49 120
weighted avg 0.60 0.59 0.58 120
- With early stopping and 3 emotions with FastAi:
Datasets :
CREMA-D :
With early stopping and 3 emotions (FastAi code) : Shreyah_code/Pretrained/src/CREMA/CREMA-Resnext-3lb.ipynb

Classification report
precision recall f1-score support
happy 0.76 0.91 0.82 140
neutral 0.85 0.74 0.79 140
angry 0.95 0.87 0.90 120
accuracy 0.84 400
macro avg 0.85 0.84 0.84 400
weighted avg 0.85 0.84 0.84 400

Classification report
precision recall f1-score support
happy 0.18 0.48 0.26 789
disgust 0.40 0.39 0.40 3452
neutral 0.34 0.18 0.23 2873
sad 0.59 0.31 0.41 2473
fear 0.38 0.44 0.41 2503
angry 0.37 0.46 0.41 3333
accuracy 0.37 15423
macro avg 0.37 0.38 0.35 15423
weighted avg 0.40 0.37 0.36 15423
RAVDESS :
With early stopping and 3 emotions (FastAi code): Shreyah_code/Pretrained/src/RAVDESS/RAVDESS-Emotions-RESNEXT-3lb.ipynb

Classification report
precision recall f1-score support
angry 1.00 0.67 0.80 12
happy 0.92 0.96 0.94 24
neutral 0.85 0.96 0.90 24
accuracy 0.90 60
macro avg 0.92 0.86 0.88 60
weighted avg 0.91 0.90 0.90 60
TESS:
With early stopping, windowing for 8 emotions (FastAi code): Shreyah_code/Pretrained/src/TESS/TESS-Windowing.ipynb
Classification report
precision recall f1-score support
happy 1.00 1.00 1.00 44
disgust 1.00 1.00 1.00 43
neutral 1.00 1.00 1.00 43
sad 0.98 0.98 0.98 42
ps 1.00 1.00 1.00 43
fear 1.00 0.98 0.99 42
angry 0.98 1.00 0.99 43
accuracy 0.99 300
macro avg 0.99 0.99 0.99 300
weighted avg 0.99 0.99 0.99 300
IEMOCAP :
With early stopping and 3 emotions (FastAi code):
Shreyah_code/Pretrained/src/EMOCAP/IEMOCAP_RESNEXT_3lb.ipynb

Classification report
precision recall f1-score support
NEU 0.69 0.79 0.74 170
HAP 0.47 0.14 0.22 143
ANG 0.71 0.85 0.78 384
accuracy 0.69 697
macro avg 0.62 0.60 0.58 697
weighted avg 0.66 0.69 0.65 697
With early stopping and 8 emotions (FastAi code):
Shreyah_code/Pretrained/src/EMOCAP/IEMOCAP_resnext.ipynb

Classification report
precision recall f1-score support
ang 0.54 0.47 0.50 170
exc 0.53 0.39 0.45 300
fru 0.47 0.50 0.49 381
neu 0.51 0.72 0.60 384
sad 0.70 0.47 0.57 245
accuracy 0.53 1480
macro avg 0.55 0.51 0.52 1480
weighted avg 0.54 0.53 0.52 1480
Datasets :
IEMOCAP:
With early stopping and 4 emotions (PyTorch code): Shreyah_code/IEMOCAP/Audio/Alexnet/Alexnet_audio_4class_mixed_train_val_balanced_emo-FN.ipynb

Classification report
precision recall f1-score support
angry 0.38 0.59 0.46 75
happy 0.58 0.33 0.42 216
sad 0.46 0.69 0.55 137
neutral 0.41 0.38 0.40 177
accuracy 0.46 605
macro avg 0.46 0.50 0.46 605
weighted avg 0.48 0.46 0.45 605
Convolutional Neural Network
Datasets :
1D CNN Model 1,2,3:
2D CNN Model 1:
2D CNN Model 2:
RAVDESS:
2D CNN Model 1:
With early stopping and 6 emotions with Model 1 (2 layers)(Keras code): Shreyah_code/SER_CNN/Ravdess_actor_split/Ravdess_cv_6label40_ft-2dcnn_model1.ipynb

Classification report for Set1
precision recall f1-score support
angry 0.41 0.31 0.35 64
disgust 0.48 0.69 0.57 64
fear 0.48 0.55 0.51 64
happy 0.30 0.27 0.28 64
neutral 0.39 0.47 0.43 32
sad 0.30 0.20 0.24 64
accuracy 0.41 352
macro avg 0.39 0.41 0.40 352
weighted avg 0.39 0.41 0.39 352
Classification report for Set2

precision recall f1-score support
angry 0.92 0.89 0.90 64
disgust 0.92 0.86 0.89 64
fear 0.84 0.92 0.88 64
happy 0.92 0.88 0.90 64
neutral 0.72 0.88 0.79 32
sad 0.92 0.86 0.89 64
accuracy 0.88 352 macro avg 0.87 0.88 0.87 352 weighted avg 0.89 0.88 0.88 352
Classification report for Set3

precision recall f1-score support
angry 0.84 0.84 0.84 64
disgust 0.85 0.94 0.89 64
fear 0.91 0.91 0.91 64
happy 0.79 0.72 0.75 64
neutral 0.81 0.91 0.85 32
sad 0.93 0.86 0.89 64
accuracy 0.86 352 macro avg 0.85 0.86 0.86 352 weighted avg 0.86 0.86 0.86 352
Classification report for Set4

precision recall f1-score support
angry 0.56 0.56 0.56 64
disgust 0.64 0.81 0.72 64
fear 0.77 0.72 0.74 64
happy 0.59 0.56 0.58 64
neutral 0.59 0.72 0.65 32
sad 0.68 0.50 0.58 64
accuracy 0.64 352
macro avg 0.64 0.65 0.64 352
weighted avg 0.64 0.64 0.64 352
Classification report for Set5

precision recall f1-score support
angry 0.62 0.52 0.56 64
disgust 0.80 0.73 0.76 64
fear 0.59 0.84 0.70 64
happy 0.50 0.50 0.50 64
neutral 0.66 0.66 0.66 32
sad 0.72 0.59 0.65 64
accuracy 0.64 352
macro avg 0.65 0.64 0.64 352
weighted avg 0.65 0.64 0.64 352
Classification report for Set6

precision recall f1-score support
angry 0.71 0.64 0.67 64
disgust 0.55 0.83 0.66 64
fear 0.80 0.61 0.69 64
happy 0.65 0.53 0.59 64
neutral 0.48 0.75 0.59 32
sad 0.64 0.47 0.54 64
accuracy 0.63 352
macro avg 0.64 0.64 0.62 352
weighted avg 0.65 0.63
2D CNN Model 2:
- With early stopping and 6 emotions Model 2 6 layers(Keras code):
Shreyah_code/SER_CNN/Ravdess_actor_split/Ravdess_cv_6label40_ft-2dcnn_model2.ipynb

Classification report for Set1
precision recall f1-score support
angry 0.41 0.31 0.35 64
disgust 0.48 0.69 0.57 64
fear 0.48 0.55 0.51 64
happy 0.30 0.27 0.28 64
neutral 0.39 0.47 0.43 32
sad 0.30 0.20 0.24 64
accuracy 0.41 352
macro avg 0.39 0.41 0.40 352
weighted avg 0.39 0.41 0.39 352

Classification report for Set2
precision recall f1-score support
angry 0.92 0.89 0.90 64
disgust 0.92 0.86 0.89 64
fear 0.84 0.92 0.88 64
happy 0.92 0.88 0.90 64
neutral 0.72 0.88 0.79 32
sad 0.92 0.86 0.89 64
accuracy 0.88 352
macro avg 0.87 0.88 0.87 352
weighted avg 0.89 0.88 0.88 352

Classification report for Set3
precision recall f1-score support
angry 0.84 0.84 0.84 64
disgust 0.85 0.94 0.89 64
fear 0.91 0.91 0.91 64
happy 0.79 0.72 0.75 64
neutral 0.81 0.91 0.85 32
sad 0.93 0.86 0.89 64
accuracy 0.86 352
macro avg 0.85 0.86 0.86 352
weighted avg 0.86 0.86 0.86 352

Classification report for Set4
precision recall f1-score support
angry 0.56 0.56 0.56 64
disgust 0.64 0.81 0.72 64
fear 0.77 0.72 0.74 64
happy 0.59 0.56 0.58 64
neutral 0.59 0.72 0.65 32
sad 0.68 0.50 0.58 64
accuracy 0.64 352
macro avg 0.64 0.65 0.64 352
weighted avg 0.64 0.64 0.64 352

Classification report for Set5
precision recall f1-score support
angry 0.62 0.52 0.56 64
disgust 0.80 0.73 0.76 64
fear 0.59 0.84 0.70 64
happy 0.50 0.50 0.50 64
neutral 0.66 0.66 0.66 32
sad 0.72 0.59 0.65 64
accuracy 0.64 352
macro avg 0.65 0.64 0.64 352
weighted avg 0.65 0.64 0.64 352

Classification report for Set6
precision recall f1-score support
angry 0.71 0.64 0.67 64
disgust 0.55 0.83 0.66 64
fear 0.80 0.61 0.69 64
happy 0.65 0.53 0.59 64
neutral 0.48 0.75 0.59 32
sad 0.64 0.47 0.54 64
accuracy 0.63 352
macro avg 0.64 0.64 0.62 352
weighted avg 0.65 0.63 0.63 352
RAVDESS:
1D CNN Model 1:
With early stopping and 6 emotions with Model 1 -1D CNN(3 layers, 5 layers, 8 layers)(Keras code):
Shreyah_code/SER_CNN/Ravdess_actor_split/Ravdess_cv_6label40_mod_ft_1d%2C358.ipynb

Model 1 Set 1
precision recall f1-score support
angry 0.90 0.94 0.92 64
disgust 0.90 0.86 0.88 64
fear 0.93 0.80 0.86 64
happy 0.87 0.95 0.91 64
neutral 0.85 0.91 0.88 32
sad 0.86 0.88 0.87 64
accuracy 0.89 352
macro avg 0.89 0.89 0.89 352
weighted avg 0.89 0.89 0.89 352

Model 2 Set 1
precision recall f1-score support
angry 0.91 0.97 0.94 64
disgust 0.88 0.89 0.88 64
fear 0.96 0.80 0.87 64
happy 0.89 0.91 0.90 64
neutral 0.90 0.88 0.89 32
sad 0.87 0.95 0.91 64
accuracy 0.90 352
macro avg 0.90 0.90 0.90 352
weighted avg 0.90 0.90 0.90 352

Model 3 Set 1
precision recall f1-score support
angry 0.87 0.94 0.90 64
disgust 0.88 0.83 0.85 64
fear 0.98 0.62 0.76 64
happy 0.71 0.92 0.80 64
neutral 0.49 0.97 0.65 32
sad 0.94 0.53 0.68 64
accuracy 0.79 352
macro avg 0.81 0.80 0.78 352
weighted avg 0.84 0.79 0.79 352
RAVDESS:
1D CNN Model 1:
With early stopping and 6 emotions with Model 1
1D CNN(3 layers, 5 layers, 8 layers)(Keras code):
Shreyah_code/SER_CNN/Ravdess_actor_split/Ravdess_cv_6label_time_13-Copy1.ipynb

Model 1 Set 1
precision recall f1-score support
angry 0.57 0.52 0.54 64
disgust 0.64 0.55 0.59 64
fear 0.64 0.56 0.60 64
happy 0.43 0.58 0.49 64
neutral 0.36 0.50 0.42 32
sad 0.65 0.53 0.59 64
accuracy 0.54 352
macro avg 0.55 0.54 0.54 352
weighted avg 0.57 0.54 0.55 352

Model 2 Set 1
precision recall f1-score support
angry 0.53 0.52 0.52 64
disgust 0.51 0.66 0.58 64
fear 0.51 0.48 0.50 64
happy 0.43 0.50 0.46 64
neutral 0.48 0.34 0.40 32
sad 0.45 0.34 0.39 64
accuracy 0.49 352
macro avg 0.48 0.47 0.47 352
weighted avg 0.48 0.49 0.48 352

Model 3 Set 1
precision recall f1-score support
angry 0.48 0.50 0.49 64
disgust 0.54 0.66 0.59 64
fear 0.67 0.45 0.54 64
happy 0.56 0.50 0.53 64
neutral 0.52 0.47 0.49 32
sad 0.54 0.66 0.59 64
accuracy 0.55 352
macro avg 0.55 0.54 0.54 352
weighted avg 0.55 0.55 0.54 352
RAVDESS:
1D CNN Model 1:
1D CNN(3 layers, 5 layers, 8 layers)(Keras code):
Shreyah_code/SER_CNN/Ravdess_actor_split/Ravdess_cv_6label13_ft_1d.ipynb

Model 1 Set 1
precision recall f1-score support
angry 0.92 0.92 0.92 64
disgust 0.88 0.91 0.89 64
fear 0.90 0.81 0.85 64
happy 0.84 0.88 0.85 64
neutral 0.87 0.81 0.84 32
sad 0.87 0.91 0.89 64
accuracy 0.88 352
macro avg 0.88 0.87 0.87 352
weighted avg 0.88 0.88 0.88 352

Model 2 Set 1
precision recall f1-score support
angry 0.86 0.94 0.90 64
disgust 0.91 0.83 0.87 64
fear 0.92 0.89 0.90 64
happy 0.92 0.95 0.94 64
neutral 0.88 0.88 0.88 32
sad 0.94 0.94 0.94 64
accuracy 0.91 352
macro avg 0.90 0.90 0.90 352
weighted avg 0.91 0.91 0.91 352

Model 3 Set 1
precision recall f1-score support
angry 0.94 0.92 0.93 64
disgust 0.90 0.88 0.89 64
fear 0.92 0.88 0.90 64
happy 0.84 0.98 0.91 64
neutral 0.97 0.91 0.94 32
sad 0.98 0.94 0.96 64
accuracy 0.92 352
macro avg 0.92 0.92 0.92 352
weighted avg 0.92 0.92 0.92 352
Datasets :
IEMOCAP:
With early stopping and 4 emotions (PyTorch code): Shreyah_code/IEMOCAP/Attention/Attention_audio_4class_mixed_train_val_balanced_emo.ipynb

Classification report
precision recall f1-score support
angry 0.32 0.53 0.40 92
happy and excited 0.44 0.32 0.37 206
sad 0.47 0.60 0.53 107
neutral 0.47 0.39 0.42 192
accuracy 0.42 597
macro avg 0.42 0.46 0.43 597
weighted avg 0.44 0.42 0.42 597
IEMOCAP:
With early stopping and 3 emotions -Angry, Sad,Neutral(PyTorch code): Shreyah_code/IEMOCAP/Attention/Attention_audio-added-angry%2Csad%2Cneutral-original.ipynb

Classification report
precision recall f1-score support
angry 0.00 0.00 0.00 122
sad 0.00 0.00 0.00 111
neutral 0.40 1.00 0.57 157
accuracy 0.40 390
macro avg 0.13 0.33 0.19 390
weighted avg 0.16 0.40 0.23 390
With early stopping and 3 emotions -Angry, Sad,Happy(PyTorch code): Shreyah_code/IEMOCAP/Attention/Attention_audio-added-angry%2Csad%2Cneutral-original.ipynb

Classification report
precision recall f1-score support
angry 0.00 0.00 0.00 122
sad 0.00 0.00 0.00 111
neutral 0.40 1.00 0.57 157
accuracy 0.40 390
macro avg 0.13 0.33 0.19 390
weighted avg 0.16 0.40 0.23 390
IEMOCAP:
Attention + CNN + LSTM for 4 emotions (PyTorch code): Shreyah_code/IEMOCAP/Audio/CNN%2BLSTM/CNN_Lstm_attn_audio-%20a%2Cs%2Ch%2Cn_test-alldata_mixed.ipynb

Classification report
precision recall f1-score support
angry 0.48 0.41 0.44 78
happy 0.25 0.63 0.36 236
sad 0.23 0.72 0.35 138
neutral 0.87 0.46 0.60 1324
accuracy 0.50 1776
macro avg 0.46 0.55 0.44 1776
weighted avg 0.72 0.50 0.54 1776
Attention+Alexnet+LSTM for 4 emotions (PyTorch code): Shreyah_code/IEMOCAP/Audio/CNN%2BLSTM/ALEXNET_Lstm_ATTN_audio_mixed_train_val_balanced_emo-%20a%2Cs%2Ch%2Cn_test.ipynb

Classification report
precision recall f1-score support
angry 0.15 0.76 0.26 92
happy 0.12 0.37 0.18 206
sad 0.17 0.01 0.02 107
neutral 0.83 0.40 0.54 1324
accuracy 0.39 1729
macro avg 0.32 0.38 0.25 1729
weighted avg 0.67 0.39 0.45 1729
Similar to Wav2Vec2, XLSR-Wav2Vec2 learns powerful speech representations from hundreds of thousands of hours of speech in more than 50 languages of unlabeled speech. Similar, to BERT's masked language modeling, the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

The authors show for the first time that massively pretraining an ASR model on cross-lingual unlabeled speech data, followed by language-specific fine-tuning on very little labeled data achieves state-of-the-art results. See Table 1-5 of the official paper.
In order to preprocess the audio into our classification model, we need to set up the relevant Wav2Vec2 assets regarding our language in this case. Since most of the data available is English I am using the foolowing fine tuned model jonatasgrosman/wav2vec2-large-xlsr-53-english fine-tuned by jonatasgrosman. To handle the context representations in any audio length we use a merge strategy plan (pooling mode) to concatenate that 3D representations into 2D representations.
So far, we downloaded, loaded, and split the SER dataset into train and test sets. The instantiated our strategy configuration for using context representations in our classification problem SER. Now, we need to extract features from the audio path in context representation tensors and feed them into our classification model to determine the emotion in the speech.
Since the audio file is saved in the .wav format, it is easy to use Librosa or others like Torchaudio.
An audio file usually stores both its values and the sampling rate with which the speech signal was digitalized. We want to store both in the dataset and write a map(...) function accordingly. Also, we need to handle the string labels into integers for our specific classification task in this case, the single-label classification you may want to use for your regression or even multi-label classification.ipynb_checkpoints/
The wandb Python library is used to track Wav2vec2 experiments. It can be intergrated with framework like PyTorch or Keras. https://wandb.ai/shreyah/EMOCAP/reports/Shared-panel-21-12-10-08-12-42--VmlldzoxMzI1MTY3
from IPython.display import IFrame
IFrame("https://wandb.ai/shreyah/EMOCAP/reports/IEMOCAP-results-with-Wav2Vec2-for-10-sets--VmlldzoxMzI1MTY3", width=800, height=650)
Datasets on Wav2vec2
IEMOCAP Scripts:
Wav2vec2 finetuning for 4 emotions (PyTorch code) Test Set 4: Shreyah_code/IEMOCAP/Audio/CNN_Transformers/ASHN_WAV2vec-Graph-set4.ipynb
Wav2vec2 finetuning for 4 emotions (PyTorch code) Test Set 5: Shreyah_code/IEMOCAP/Audio/CNN_Transformers/ASHN_WAV2vec-Graph-set5.ipynb
Wav2vec2 finetuning for 4 emotions (PyTorch code) Test Set 7: Shreyah_code/IEMOCAP/Audio/CNN_Transformers/ASHN_WAV2vec-Graph-set7.ipynb
Wav2vec2 finetuning for 4 emotions (PyTorch code) Test Set 8: Shreyah_code/IEMOCAP/Audio/CNN_Transformers/ASHN_WAV2vec-Graph-set8.ipynb
Wav2vec2 finetuning for 4 emotions (PyTorch code) Test Set 9: Shreyah_code/IEMOCAP/Audio/CNN_Transformers/ASHN_WAV2vec-Graph-set9.ipynb
Wav2vec2 finetuning for 4 emotions (PyTorch code) Test Set 10: Shreyah_code/IEMOCAP/Audio/CNN_Transformers/ASHN_WAV2vec-Graph-set10.ipynb
Saved CSV files for test train and validation split are in the folder:Emotion_Data/IEMOCAP_full_release/splits/
The results obtained from Wav2Vec2 is far more improved than the previous models like CNN, Alexnet , Alexnet + Attention.
The scripts for cross validation for the other test sets can be found below:

Classification report
precision recall f1-score support
angry 0.71 0.83 0.76 78
happy 0.84 0.63 0.72 236
neutral 0.59 0.81 0.69 192
sad 0.75 0.60 0.67 138
accuracy 0.70 644
macro avg 0.72 0.72 0.71 644
weighted avg 0.73 0.70 0.70 644
The scripts for training the greek emotion data on Wav2Vec2 can be seen below:
Wav2vec2 finetuning for 5 emotions (PyTorch code): Shreyah_code/IEMOCAP/Audio/CNN_Transformers/Emotion%20recognition%20in%20Greek%20speech%20using%20Wav2Vec2.ipynb
Classification report
precision recall f1-score support
anger 1.00 1.00 1.00 24
disgust 1.00 1.00 1.00 24
fear 1.00 0.96 0.98 24
happiness 0.96 1.00 0.98 24
sadness 1.00 1.00 1.00 25
accuracy 0.99 121
macro avg 0.99 0.99 0.99 121
weighted avg 0.99 0.99 0.99 121
Datasets on Wav2vec2
RAVDESS Scripts:
Wav2vec2 finetuning for 8 emotions (PyTorch code)Test Set 1: Shreyah_code/RAVDESS/Wav2Vec2/src/Ravdess_WAV2vec_set1.ipynb
Wav2vec2 finetuning for 4 emotions (PyTorch code) Test Set 2: Shreyah_code/RAVDESS/Wav2Vec2/src/Ravdess_WAV2vec_set2.ipynb
Wav2vec2 finetuning for 4 emotions (PyTorch code) Test Set 3: Shreyah_code/RAVDESS/Wav2Vec2/src/Ravdess_WAV2vec_set3.ipynb
Saved CSV files for test train and validation split in RAVDESS are in the folder:Emotion_Data/Datasets/RAVDESS/Preprocessing
The scripts for training the RAVDESS emotion data on Wav2Vec2 can be seen below:
Results for Wav2vec2 finetuning for 8 emotions on RAVDESS dataset(PyTorch code) on Set1 test data is: Shreyah_code/IEMOCAP/Audio/CNN_Transformers/Ravdess_WAV2vec_set1.ipynb

Classification report
precision recall f1-score support
angry 0.84 0.81 0.83 32
calm 0.81 0.69 0.75 32
disgust 0.91 0.97 0.94 32
fear 0.72 0.97 0.83 32
happy 0.62 0.56 0.59 32
neutral 0.83 0.62 0.71 16
sad 0.62 0.50 0.55 32
surprise 0.76 0.91 0.83 32
accuracy 0.76 240
macro avg 0.76 0.75 0.75 240
weighted avg 0.76 0.76 0.76 240
F1: 0.7527456698170306
acc: 0.7625
Datasets on Wav2vec2 for CMU :
CMU Scripts:
Wav2vec2 finetuning for CMU data considering the emotions HAPPY,ANGRY,SAD,DISGUST,SURPRISE,FEAR and performing CLASSIFICATION as only the audios with a single emotion is considered for 7 emotions (PyTorch code):Shreyah_code/MOSEI/Wav2vec/Audio/src/Training_Windowed-CMU_classification_7emotions.ipynb
Windowing and post-processsing for the CMU data (HAPPY,ANGRY,SAD,NEUTRAL):
Shreyah_code/MOSEI/Wav2vec/Audio/src/Loaded_data_Windowed-CMU_classification_H_A_S_N-win_postproc.ipynb
The scripts for training the CMU REGRESSION emotion data on Wav2Vec2 can be seen below:
Wav2vec2 finetuning for CMU data considering multiple labels and performing Regression for 6 emotions (PyTorch code):Shreyah_code/MOSEI/Wav2vec/Audio/src/Regression_model-Windowed-4sec.ipynb
Next, the evaluation metric is defined. There are many pre-defined metrics for classification/regression problems, but in this case, we would continue with just Accuracy for classification and MSE for regression.
Threshold: Let's assume we are using 0.5 as the threshold for prediction
precision recall f1-score support
happy 0.65 0.64 0.65 746
sad 0.47 0.54 0.50 559
anger 0.59 0.30 0.39 613
surprise 0.00 0.00 0.00 213
disgust 0.60 0.55 0.57 545
fear 0.00 0.00 0.00 76
micro avg 0.58 0.46 0.51 2752
macro avg 0.38 0.34 0.35 2752
weighted avg 0.52 0.46 0.48 2752
samples avg 0.48 0.43 0.43 2752
precision recall f1-score support
happy 0.64 0.64 0.64 746
sad 0.47 0.54 0.50 559
anger 0.60 0.30 0.40 613
surprise 0.00 0.00 0.00 213
disgust 0.59 0.53 0.56 545
fear 0.00 0.00 0.00 76
micro avg 0.57 0.45 0.51 2752
macro avg 0.38 0.33 0.35 2752
weighted avg 0.52 0.45 0.47 2752
samples avg 0.48 0.43 0.43 2752
CMU Classification:

Classification report
precision recall f1-score support
Anger 0.15 0.49 0.24 503
Disgust 0.05 0.28 0.08 140
Fear 0.04 0.15 0.07 102
Happy 0.60 0.52 0.56 4031
Neutral 0.48 0.18 0.26 3480
Sad 0.23 0.31 0.26 1040
Surprise 0.00 0.04 0.01 49
accuracy 0.36 9345
macro avg 0.22 0.28 0.21 9345
weighted avg 0.47 0.36 0.38 9345
Windowing and post-processsing for the CMU data (HAPPY,ANGRY,SAD,NEUTRAL):
This script will train the windowed files on CMU in the directory. The windowed data is stored as py arrow datasets.For the emotions considered here (HAPPY,ANGRY,SAD,DISGUST,SURPRISE,FEAR) and the label must be 1, meaning only 1 emotion data points are considered. The results thus obtained considers regression values.
SCRIPT PATH: Shreyah_code/MOSEI/Wav2vec/Audio/src/Loaded_data_Windowed-CMU_classification_H_A_S_N-win_postproc.ipynb
- RESULTS BEFORE POST-PROCESSING ON TEST:

- Classification report
precision recall f1-score support
Anger 0.62 0.09 0.15 1600
Happy 0.63 0.53 0.58 3462
Neutral 0.22 0.31 0.26 1324
Sad 0.30 0.61 0.40 1361
accuracy 0.41 7747
macro avg 0.44 0.38 0.35 7747
weighted avg 0.50 0.41 0.40 7747

- Classification report
precision recall f1-score support
Anger 0.39 0.08 0.13 216
Happy 0.76 0.60 0.67 1586
Neutral 0.35 0.33 0.34 691
Sad 0.21 0.59 0.31 306
accuracy 0.49 2799
macro avg 0.43 0.40 0.36 2799
weighted avg 0.57 0.49 0.51 2799
Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. Google is leveraging BERT to better understand user searches.
The BERT has a superior performance in the task for Emotion recogtion from Text because the model the text representation from both directions to get a better sense of the context and the relationship. Unlike other models that looked at the data from left to right or left to right. How it works is that it takes the input sentence and learn the representation bi-directionally. The trained bi-directional language transformer models captured the relationship and the language context much better and accurately.
BERT has a constraint on the maximum length of a sequence after tokenizing. For any BERT model, the maximum sequence length after tokenization is 512. But we set a sequence length of 64 as the token count is distribution can be seen for the EMOCAP/CMU dataset in this range .
Input data needs to be prepared in a special way. BERT uses WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary. There are 2 special tokens that are introduced in the text –
BERT’s model architecture is a multi-layer bidirectional Transformer encoder. For the Finetuning on CMU and MOSEI data I used the BERT “base” model which has 12 Transformer blocks or layers, 16 self-attention heads, hidden size of 768.
The text data extracted from each of the IEMOCAP dataset (sentences) is passed on to a BERT model :
IEMOCAP Scripts to extract sentences is located in: Shreyah_code/IEMOCAP/Audio/Preprocessing_script.ipynb. Running the script will produce the files processed_tran.csv and processed_label.txt which contain the SessionID, Dialogue and the labels respectively extracted from transcripts
Training EMOCAP text on Bert:
The Scripts to train the text model for Bert on IEMOCAP for 4 emotions (ANGRY,SAD,HAPPY,NEUTRAL) for a classification (PyTorch code) is in: Shreyah_code/IEMOCAP/Text/IEMOCAP-Text_4_emo.ipynb
IEMOCAP Classification:

Classification report
precision recall f1-score support
angry 0.52 0.73 0.61 78
happy and excited 0.83 0.56 0.67 236
sad 0.61 0.58 0.59 138
neutral 0.55 0.70 0.62 192
accuracy 0.63 644
macro 0.63 0.64 0.62 644
weighted avg 0.66 0.63 0.63 644
The Scripts to train the text model with Bert on IEMOCAP for 3 emotions (ANGRY,SAD,HAPPY) classification (PyTorch code) is in : Shreyah_code/IEMOCAP/Text/IEMOCAP-Text_4_emo.ipynb

Classification report
precision recall f1-score support
angry 0.63 0.85 0.72 78
happy and excited 0.88 0.75 0.81 236
sad 0.71 0.75 0.73 138
accuracy 0.77 452
macro avg 0.74 0.78 0.75 452
weighted avg 0.78 0.77 0.77 452
- train_features_timestamp.csv
- val_features_timestamp.csv
- test_features_timestamp.csv
These pre-processed files once loaded into a dataframe the data looks like the table below:
As CMU is a multilabel muticlass dataset, only the data points containing 1 label at a given instance of time is considered. The emotions which have more than 1 is discarded while training and the datapoints where no emotion exisit is considered as Neutral. This procedure is followed for CMU data to train the BERT for emotion classification.
Training EMOCAP text on Bert:
The Scripts to train the text model for Bert on IEMOCAP for 5 emotions (ANGRY,SAD,HAPPY,NEUTRAL,DISGUST) for a classification (PyTorch code) is: Shreyah_code/MOSEI/Text/Emotion/Bert/Text_Preprocessing_Bert_Multiclass_4_emo.ipynb

Classification report
precision recall f1-score support
Happy 0.63 0.66 0.65 2014
Sad 0.32 0.24 0.27 594
Neutral 0.36 0.47 0.41 1338
Anger 0.41 0.12 0.19 537
accuracy 0.48 4483
macro avg 0.43 0.37 0.38 4483
weighted avg 0.48 0.48 0.47 4483
The results shows that the test set performs well with Happy and Neutral only.To improve the results for Sad and Anger, the data from other sources such as EMOCAP and SMILE data shown below were combined to balance the emotions and to see it there was any improvement in the test set.
The training data before adding EMOCAP's Sad and Anger :
Anger 1590 Location:Shreyah_code/MOSEI/Text/Emotion/Bert/Text%20_Preprocessing_Bert_Multiclass_4emo_Emocap_MOSEI.ipynb

Classification report
precision recall f1-score support
Anger 0.43 0.15 0.22 487
Happy 0.62 0.73 0.67 1955
Sad 0.28 0.17 0.21 539
Neutral 0.37 0.42 0.39 1257
accuracy 0.50 4238
macro avg 0.42 0.37 0.37 4238
weighted avg 0.48 0.50 0.48 4238
The Scripts combines IEMOCAP with the CMU data inorder to balance the emotion imbalance for 4 emotions (ANGRY,SAD,HAPPY,NEUTRAL) for a classification (PyTorch code). After adding the IEMOCAP data, the same CMU test data was used to evaluate. The data distribution after adding EMOCAP :

The training data after adding EMOCAP and SMILE's Sad and Anger emotion :
Anger 5420
Classification report
precision recall f1-score support
Anger 0.40 0.14 0.21 487
Happy 0.60 0.77 0.68 1955
Sad 0.28 0.10 0.14 539
Neutral 0.38 0.42 0.40 1257
accuracy 0.51 4238
macro avg 0.41 0.36 0.36 4238
weighted avg 0.47 0.51 0.47 4238
The RoBERTa model (Liu et al., 2019) introduces some key modifications above the BERT MLM (masked-language modeling) training procedure. The authors highlight “the importance of exploring previously unexplored design choices of BERT”. Details of these design choices can be found in the paper’s Experimental Setup section.
RoBERTa is trained on BookCorpus (Zhu et al., 2015), amongst other datasets. A recently published work BerTweet (Nguyen et al., 2020) provides a pre-trained BERT model (using the RoBERTa procedure) on vast Twitter corpora in English. They argue that BerTweet better models the characteristic of language used on the Twitter subspace, outperforming previous SOTA models on Tweet NLP tasks. Hence, it is a good indicator that the performance on downstream tasks is greatly influenced by what our LM captures!
Similary RoBERTa has achieved a bench mark in IEMOCAP data https://paperswithcode.com/sota/emotion-recognition-in-conversation-on.
Therefore I am using RoBERTa to analyse the improvement on the CMU data
RoBERTa in the Transformers library The 🤗Transformers library comes bundled with classes & utilities to apply various tasks on the RoBERTa model.
- train_features_timestamp.csv
- val_features_timestamp.csv
- test_features_timestamp.csv
These pre-processed files once loaded into a dataframe the data looks like the table below:
As CMU is a multilabel muticlass dataset, only the data points containing 1 label at a given instance of time is considered. The emotions which have more than 1 is discarded while training and the datapoints where no emotion exisit is considered as Neutral. This procedure is followed for CMU data to train the RoBERTa for emotion classification.
Training EMOCAP text on RoBERTa:
The Scripts to train the text model for Bert on IEMOCAP for 4 emotions (ANGRY,SAD,HAPPY,NEUTRAL) for a classification (PyTorch-Lightning code) is:
Shreyah_code/MOSEI/Text/Emotion/RobertaBert/Text%20Preprocessing-RobertaBert_4_emotions.ipynb

Classification report
precision recall f1-score support
Happy 0.6461 0.8613 0.7383 1651
Sad 0.2661 0.0971 0.1422 340
Anger 0.0000 0.0000 0.0000 227
Neutral 0.3708 0.3190 0.3429 765
accuracy 0.5696 2983
macro avg 0.3208 0.3193 0.3059 2983
weighted avg 0.4830 0.5696 0.5128 2983
The results shows that the test set performs well with Happy and Neutral only. Anger and Sad seems to show a poor performance.To improve the results for Sad and Anger, the data from other sources such as EMOCAP and SMILE data shown below were combined to balance the emotions and to see it there was any improvement in the test set.
The Scripts combines IEMOCAP with the CMU data inorder to balance the emotion imbalance for 4 emotions (ANGRY,SAD,HAPPY,NEUTRAL) for a classification (PyTorch code). After adding the IEMOCAP data, the same CMU test data was used to evaluate. The data distribution after adding EMOCAP :
_Shreyah_code/MOSEI/Text/Emotion/RobertaBert/Roberta_4emotions_Emocap_MOSEISMILE.ipynb

The training data after adding EMOCAP and SMILE's Sad and Anger emotion :
Anger 5406
Classification report
precision recall f1-score support
Happy 0.380000 0.078029 0.129472 487
Sad 0.589490 0.791816 0.675835 1955
Anger 0.290698 0.092764 0.140647 539
Neutral 0.365672 0.389817 0.377358 1257
accuracy 0.501652 4238
macro avg 0.406465 0.338107 0.330828 4238
weighted avg 0.461031 0.501652 0.456456 4238
Only the final classification layer is trained.
Audio :
Pre-processing : Refer to 4.3 Section to know further on IEMOCAP and Pre-processing steps.
Fusion Script for Bert based Text and Alexnet based audio model fusion : Shreyah_code/IEMOCAP/Audio%2BText/text%2Baudio_ash.ipynb

Classification report
precision recall f1-score support
angry 0.57 0.69 0.62 75
happy 0.82 0.62 0.70 214
sad 0.59 0.69 0.64 136
neutral 0.58 0.62 0.60 172
accuracy 0.64 597
macro avg 0.64 0.65 0.64 597
weighted avg 0.66 0.64 0.65 597
Classification report
precision recall f1-score support
angry 0.53 0.81 0.64 78
happy 0.86 0.63 0.73 236
sad 0.71 0.73 0.72 138
neutral 0.62 0.68 0.65 192
accuracy 0.69 644
macro avg 0.68 0.71 0.68 644
weighted avg 0.72 0.69 0.69 644